On lexicon creation for turkish LVCSR
نویسندگان
چکیده
In this paper, we address the lexicon design problem in Turkish large vocabulary speech recognition. Although we focus only on Turkish, the methods described here are general enough that they can be considered for other agglutinative languages like Finnish, Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. The standard approach to this problem is to discover a number of primitive units so that a large set of words can be created by compounding those units. Two broad classes of methods are available for splitting words into their sub-units; morphology-based and data-driven methods. Although the word splitting significantly reduces the out of vocabulary rate, it shrinks the context and increases acoustic confusibility. We have used two methods to address the latter. In one method, we use word counts to avoid splitting of high frequency lexical units, and in the other method, we recompound splits according to a probabilistic measure. We present experimental results that show the methods are very effective to lower the word error rate at the expense of lexicon size.
منابع مشابه
Lattice extension and rescoring based approaches for LVCSR of Turkish
In this paper, we present some techniques to solve the problems of Turkish Large Vocabulary Continuous Speech Recognition (LVCSR). Its agglutinative nature makes Turkish a challenging language in terms of speech recognition since it is impossible to include all possible words in the recognition lexicon. Therefore, data-driven sub-word recognition units, in addition to words, are used in a newsp...
متن کاملA hybrid language model for open-vocabulary Thai LVCSR
This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudomorpheme (PM), a syllable-like sub-word unit specif...
متن کاملLexicon Optimization for WFST-Based Speech Recognition Using Acoustic Distance Based Confusability Measure and G2P Conversion
In this paper, we propose a lexicon optimization method based on a confusability measure (CM) to develop a large vocabulary continuous speech recognition (LVCSR) system with unseen words. When a lexicon is built or expanded for unseen words by using grapheme-to-phoneme (G2P) conversion, the lexicon size increases since G2P is generally realized by 1-to-N-best mapping. Thus, the proposed method ...
متن کاملTowards better language modeling for Thai LVCSR
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes us...
متن کاملSpeech Input Acoustic Analysis Phoneme Inventory Pronunciation Lexicon
This paper gives an overview of an architecture and search organization for large vocabulary, continuous speech recognition (LVCSR at RWTH). In the rst part of the paper, we describe the principle and architecture of a LVCSR system. In particular, the issues of modeling and search for phoneme based recognition are discussed. In the second part, we review the word conditioned lexical tree search...
متن کامل